62 research outputs found

    Promoting user engagement and learning in amorphous search tasks

    Get PDF
    Much research in information retrieval (IR) focuses on optimization of the rank of relevant retrieval results for single shot ad hoc IR tasks. Relatively little research has been carried out on user engagement to support more complex search tasks. We seek to improve user engagement for IR tasks by providing richer representation of retrieved information. It is our expectation that this strategy will promote implicit learning within search activities. Specifically, we plan to explore methods of finding semantic concepts within retrieved documents, with the objective of creating improved document surrogates. Further, we would like to study search effectiveness in terms of different facets such as the user’s search experience, satisfaction, engagement and learning. We intend to investigate this in an experimental study, where our richer document representations are compared with the traditional document surrogates for the same user queries

    HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

    Full text link
    Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.Comment: PVLDB 11(8):906-919, 201

    DCU-SEManiacs at SemEval-2016 task 1: synthetic paragram embeddings for semantic textual similarity

    Get PDF
    We experiment with learning word representations designed to be combined into sentence level semantic representations, using an objective function which does not directly make use of the supervised scores provided with the training data, instead opting for a simpler objective which encourages similar phrases to be close together in the embedding space. This simple objective lets us start with high quality embeddings trained using the Paraphrase Database (PPDB) (Wieting et al., 2015; Ganitkevitch et al., 2013), and then tune these embeddings using the official STS task training data, as well as synthetic paraphrases for each test dataset, obtained by pivoting through machine translation. Our submissions include runs which only compare the similarity of phrases in the embedding space, directly using the similarity score to produce predictions, as well as a run which uses vector similarity in addition to a suite of features we investigated for our 2015 Semeval submission. For the crosslingual task, we simply translate the Spanish sentences to English, and use the same system we designed for the monolingual task

    Generalized Zero-Shot Learning via Synthesized Examples

    Full text link
    We present a generative framework for generalized zero-shot learning where the training and test classes are not necessarily disjoint. Built upon a variational autoencoder based architecture, consisting of a probabilistic encoder and a probabilistic conditional decoder, our model can generate novel exemplars from seen/unseen classes, given their respective class attributes. These exemplars can subsequently be used to train any off-the-shelf classification model. One of the key aspects of our encoder-decoder architecture is a feedback-driven mechanism in which a discriminator (a multivariate regressor) learns to map the generated exemplars to the corresponding class attribute vectors, leading to an improved generator. Our model's ability to generate and leverage examples from unseen classes to train the classification model naturally helps to mitigate the bias towards predicting seen classes in generalized zero-shot learning settings. Through a comprehensive set of experiments, we show that our model outperforms several state-of-the-art methods, on several benchmark datasets, for both standard as well as generalized zero-shot learning.Comment: Accepted in CVPR'1

    Position paper: promoting user engagement and learning in search tasks by effective document representation

    Get PDF
    Much research in information retrieval (IR) focuses on optimization of the rank of relevant retrieval results for single shot ad hoc IR tasks with straightforward information needs. Relatively little research has been carried out to study and support user learning and engagement for more complex search tasks. We introduce an approach intended to improve topical knowledge of a user while undertaking IR tasks. Specifically, we propose to explore methods of finding useful and informative textual units (semantic concepts) within retrieved documents, with the objective of creating improved document surrogates for presentation within the search process. We hypothesize that this strategy will promote improved implicit learning within search activities. We believe that the richer document representations proposed in the paper would help to promote engagement, understanding and learning as compared to more traditional search engine document snippets. We propose a framework for holistic evaluation of our proposed document representations and their use in search

    How do users perceive information: analyzing user feedback while annotating textual units

    Get PDF
    We describe an initial study of how participants perceive information when they categorize highlighted textual units within a document marked for a given information need. Our investigation explores how users look at different parts of the document and classify textual units within retrieved documents on 4-levels of relevance and importance. We compare how users classify different textual units within a document, and report mean and variance for different users across different topics. Further, we analyze and categorise the reasons provided by users while rating textual units within retrieved documents. This research shows some interesting observations regarding why some parts of the document are regarded as more relevant than others (e.g. it provides contextual information, contains background information) and which kind of information seems to be effective for satisfying the end users (e.g showing examples, providing facts) in a search task. This work is a part of our ongoing investigation into generation of effective surrogates and document summaries based on search topics and user interactions with information

    Promoting user engagement and learning in search tasks by effective document representation

    Get PDF
    Much research in information retrieval (IR) focuses on optimisation of the rank of relevant retrieval results for single shot ad hoc IR tasks. Relatively little research has been carried out on supporting and promoting user engagement within search tasks. We seek to improve user experience by use of enhanced document snippets to be presented during the search process to promote user engagement with retrieved information. The primary role of document snippets within search has traditionally been to indicate the potential relevance of retrieved items to the user’s information need. Beyond the relevance of an item, it is generally not possible to infer the contents of individual ranked results just by reading the current snippets. We hypothesise that the creation of richer document snippets and summaries, and effective presentation of this information to users will promote effective search and greater user engagement, and support emerging areas such as learning through search. We generate document summaries for a given query by extracting top relevant sentences from retrieved documents. Creation of these summaries goes beyond exist- ing snippet creation methods by comparing content between documents to take into account novelty when selecting content for inclusion in individual document sum- maries. Further, we investigate the readability of the generated summaries with the overall goal of generating snippets which not only help a user to identify document relevance, but are also designed to increase the user’s understanding and knowledge of a topic gained while inspecting the snippets. We perform a task-based user study to record the user’s interactions, search be- haviour and feedback to evaluate the effectiveness of our snippets using qualitative and quantitative measures. In our user study, we found that richer snippets generated in this work improved the user experience and topical knowledge, and helped users to learn about the topic effectively

    DCU at FIRE 2013: cross-language !ndian news story search

    Get PDF
    We present an overview of our work carried out for DCU’s participation in the Cross Language !ndian News Story Search (CL!NSS) task at FIRE 2013. Our team submitted 3 main runs and 2 additional runs for this task. Our approach consisted of 2 steps: (1) the Lucene search engine was used with varied input query formulations using different features and heuristics designed to identify as many relevant documents as possible to improve recall; (2) document list merging and re-ranking was performed with the incorporation of a date feature. The results of our best run were ranked first among official submissions based on NDCG@5 and NDCG@10 values and second for NDCG@1 values. For the 25 test queries the results of our best main run were NDCG@1 0.7400, NDCG@5 0.6809 and NDCG@10 0.7268

    The good, the bad and their kins: identifying questions with negative scores in StackOverflow

    Get PDF
    A rapid increase in the number of questions posted on community question answering (CQA) forums is creating a need for automated methods of question quality moderation to improve the effectiveness of such forums in terms of response time and quality. Such automated approaches should aim to classify questions as good or bad for a particular forum as soon as they are posted based on the guidelines and quality standards defined/listed by the forum. Thus, if a question meets the standard of the forum then it is classified as good else we classify it as bad. In this paper, we propose a method to address this problem of question classification by retrieving similar questions previously asked in the same forum, and then using the text from these previously asked similar questions to predict the quality of the current question. We empirically validate our proposed approach on the set of StackOverflow data, a massive CQA forum for programmers, comprising of about 8M questions. With the use of these additional text retrieved from similar questions, we are able to improve the question quality prediction accuracy by about 2.8% and improve the recall of negatively scored questions by about 4.2%. This improvement of 4.2% in recall would be helpful in automatically flagging questions as bad (unsuitable) for the forum and will speed up the moderation process thus saving time and human effort
    corecore